NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability

https://doi.org/10.1145/3774934.3786453

Ma, Yuchen; Stathopoulos, Andreas; Ren, Bin (January 2026, Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming}{January 31 -- February 4, 2026)

Distributed matrix-block-vector multiplication (Matvec) algorithm is a critical component of many applications, but can be computationally challenging for dense matrices of dimension \(O(10^6\text{--}10^7)\) and blocks of \(O(10\text{--}100)\) vectors. We present performance analysis, implementation, and optimization of our \pname{} library for Matvec under the effect of system variability. Our modeling shows that 1D pipelining Matvec is as efficient as 2D algorithms at small to medium clusters, which are sufficient for these problem sizes. We develop a performance tracing framework and a simulator that reveal pipeline bubbles caused by modest \textasciitilde{}5\% system variability. To tolerate such variability, our \pname{} library, which combines on-the-fly kernel matrix generation and Matvec, integrates four optimizations: inter-process data preloading, unconventional static thread scheduling, cache-aware tiling, and multi-version unrolling. In our benchmarks on \(O(10^5)\) Matvec problems, \pname{} achieves up to 1.85× speedup over COSMA and 17× over ScaLAPACK. For \(O(10^6)\) problems, where COSMA and ScaLAPACK exceed memory capacity, \pname{} maintains linear strong scaling and achieves peak performance of 75\%~FMA~Flop/s. Its static scheduling policy has a 2.27× speedup compared to the conventional work-stealing dynamic scheduler, and is predicted to withstand up to 108\% performance variability under exponential distributed variability simulation.
more » « less
Full Text Available
Generalizing Reuse Patterns for Efficient DNN on Microcontrollers

https://doi.org/10.1145/3676641.3716257

Liu, Jiesong; Ren, Bin; Shen, Xipeng (March 2025, ACM)

Full Text Available
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Full Text Available
Reducing Unfairness in Distributed Community Detection

https://doi.org/10.1109/ICDM59182.2024.00121

Zhang, Hao; Jayaweera, Malith; Ren, Bin; Wang, Yanzhi; Soundarajan, Sucheta (December 2024, IEEE)

Full Text Available
TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

https://doi.org/10.1145/3721145.3725774

Guan, Jiexiong; Hu, Zhenqing; Antonopoulos, Christos D; Bellas, Nikolaos; Lalis, Spyros; Smirni, Evgenia; Zhou, Gang; Agrawal, Gagan; Ren, Bin (June 2025, ACM)

The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines.
more » « less
Full Text Available
SoD2: Statically Optimizing Dynamic Neural Network Execution

Niu, Wei; Agrawal, Gagan; Ren, Bin (May 2024, ACM)

Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution are becoming common. This paper presents SoD2, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains.
more » « less
Full Text Available
Towards Recognizing Food Types for Unseen Subjects

https://doi.org/10.1145/3696424

Guan, Jiexiong; Wang, Junjie; Niu, Wei; Peng, Zhen; Wang, Shuangquan; Liu, Zhenming; Zhou, Gang; Ren, Bin (September 2024, ACM Transactions on Computing for Healthcare)

Recognizing food types through sensor signals for unseen users remains remarkably challenging, despite extensive recent studies. The efficacy of prior machine learning techniques is dwarfed by giant variations of data collected from multiple participants, partly because users have varied chewing habits and wear sensor devices in various manners. This work treats the problem as an instance of the domain adaptation problem, where each user represents a domain. We develop the first multi-source domain adaptation (MSDA) method for food-typing recognition, which consists of three major components: stratified normalization, a multi-source domain adaptor, and adaptive ensemble learning. New techniques are developed for each component. Using a real-world dataset comprised of 15 participants, we demonstrate that our method achieves\(1.33\times\)to\(2.13\times\)improvement in accuracy compared with nine state-of-the-art MSDA baselines. Additionally, we perform an in-depth ablation study to examine the behavior of each component and confirm their efficacy.
more » « less
Full Text Available
SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

https://doi.org/10.1145/3617232.3624869

Niu, Wei; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
DEFCON: Deformable Convolutions Leveraging Interval Search and GPU Texture Hardware

https://doi.org/10.1109/IPDPS57955.2024.00063

Jayaweera, Malith; Li, Yanyu; Wang, Yanzhi; Ren, Bin; Kaeli, David (May 2024, IEEE)

Full Text Available
On Item-Sampling Evaluation for Recommender System

https://doi.org/10.1145/3629171

Li, Dong; Jin, Ruoming; Liu, Zhenming; Ren, Bin; Gao, Jing; Liu, Zhi (March 2024, ACM Transactions on Recommender Systems)

Personalized recommender systems play a crucial role in modern society, especially in e-commerce, news, and ads areas. Correctly evaluating and comparing candidate recommendation models is as essential as constructing ones. The common offline evaluation strategy is holding out some user-interacted items from training data and evaluating the performance of recommendation models based on how many items they can retrieve. Specifically, for any hold-out item or so-called target item for a user, the recommendation models try to predict the probability that the user would interact with the item and rank it among overall items, which is calledglobal evaluation. Intuitively, a good recommendation model would assign high probabilities to such hold-out/target items. Based on the specific ranks, some metrics likeRecall@KandNDCG@Kcan be calculated to further quantify the quality of the recommender model. Instead of ranking the target items among all items, Koren first proposed to rank them among a smallsampled set of items, then quantified the performance of the models, which is calledsampling evaluation. Ever since then, there has been a large amount of work adopting sampling evaluation due to its efficiency and frugality. In recent work, Rendle and Krichene argued that the sampling evaluation is “inconsistent” with respect to a global evaluation in terms of offline top-Kmetrics. In this work, we first investigate the “inconsistent” phenomenon by taking a glance at the connections between sampling evaluation and global evaluation. We reveal the approximately linear relationship between sampling with respect to its global counterpart in terms of the top-KRecall metric. Second, we propose a new statistical perspective of the sampling evaluation—to estimate the global rank distribution of the entire population. After the estimated rank distribution is obtained, the approximation of the global metric can be further derived. Third, we extend the work of Krichene and Rendle, directly optimizing the error with ground truth, providing not only a comprehensive empirical study but also a rigorous theoretical understanding of the proposed metric estimators. To address the “blind spot” issue, where accurately estimating metrics for small top-Kvalues in sampling evaluation is challenging, we propose a novel adaptive sampling method that generalizes the expectation-maximization algorithm to this setting. Last but not least, we also study the user sampling evaluation effect. This series of works outlines a clear roadmap for sampling evaluation and establishes a foundational theoretical framework. Extensive empirical studies validate the reliability of the sampling methods presented.
more » « less
Full Text Available

« Prev Next »

Search for: All records